Dr Ignaz Semmelweis was a Hungarian physician born in 1818 who worked in the Vienna General Hospital. In the past people thought of illness as caused by "bad air" or evil spirits. But in the 1800s Doctors started looking more at anatomy, doing autopsies and started making arguments based on data. Dr Semmelweis suspected that something was going wrong with the procedures at Vienna General Hospital. Semmelweis wanted to figure out why so many women in maternity wards were dying from childbed fever (i.e., puerperal fever).
Today you will become Dr Semmelweis. This is your office 👆. You will step into Dr Semmelweis' shoes and analyse the same data collected from 1841 to 1849.
Dr Semmelweis published his research in 1861. I found the scanned pages of the full text with the original tables in German, but an excellent English translation can be found here.
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
%pip install --upgrade plotly
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: plotly in /home/mitresh/.local/lib/python3.10/site-packages (5.14.1) Requirement already satisfied: tenacity>=6.2.0 in /home/mitresh/.local/lib/python3.10/site-packages (from plotly) (8.2.2) Requirement already satisfied: packaging in /home/mitresh/.local/lib/python3.10/site-packages (from plotly) (23.1) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
pd.options.display.float_format = '{:,.2f}'.format
# Create locators for ticks on the time axis
years = mdates.YearLocator()
months = mdates.MonthLocator()
years_fmt = mdates.DateFormatter('%Y')
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
df_yearly = pd.read_csv('annual_deaths_by_clinic.csv')
# parse_dates avoids DateTime conversion later
df_monthly = pd.read_csv('monthly_deaths.csv',
parse_dates=['date'])
Challenge: Check out these two DataFrames ☝️.
df_yearly.shape
(12, 4)
df_monthly.shape
(98, 3)
df_yearly.head()
| year | births | deaths | clinic | |
|---|---|---|---|---|
| 0 | 1841 | 3036 | 237 | clinic 1 |
| 1 | 1842 | 3287 | 518 | clinic 1 |
| 2 | 1843 | 3060 | 274 | clinic 1 |
| 3 | 1844 | 3157 | 260 | clinic 1 |
| 4 | 1845 | 3492 | 241 | clinic 1 |
df_yearly.tail()
| year | births | deaths | clinic | |
|---|---|---|---|---|
| 7 | 1842 | 2659 | 202 | clinic 2 |
| 8 | 1843 | 2739 | 164 | clinic 2 |
| 9 | 1844 | 2956 | 68 | clinic 2 |
| 10 | 1845 | 3241 | 66 | clinic 2 |
| 11 | 1846 | 3754 | 105 | clinic 2 |
df_yearly.isna().values.any()
False
df_yearly.duplicated().values.any()
False
df_monthly.isna().values.any()
False
df_monthly.duplicated().values.any()
False
df_monthly.describe()
| date | births | deaths | |
|---|---|---|---|
| count | 98 | 98.00 | 98.00 |
| mean | 1845-02-11 04:24:29.387755008 | 267.00 | 22.47 |
| min | 1841-01-01 00:00:00 | 190.00 | 0.00 |
| 25% | 1843-02-08 00:00:00 | 242.50 | 8.00 |
| 50% | 1845-02-15 00:00:00 | 264.00 | 16.50 |
| 75% | 1847-02-22 00:00:00 | 292.75 | 36.75 |
| max | 1849-03-01 00:00:00 | 406.00 | 75.00 |
| std | NaN | 41.77 | 18.14 |
df_yearly.describe()
| year | births | deaths | |
|---|---|---|---|
| count | 12.00 | 12.00 | 12.00 |
| mean | 1,843.50 | 3,152.75 | 223.33 |
| std | 1.78 | 449.08 | 145.38 |
| min | 1,841.00 | 2,442.00 | 66.00 |
| 25% | 1,842.00 | 2,901.75 | 100.25 |
| 50% | 1,843.50 | 3,108.50 | 219.50 |
| 75% | 1,845.00 | 3,338.25 | 263.50 |
| max | 1,846.00 | 4,010.00 | 518.00 |
Challenge: How dangerous was childbirth in the 1840s in Vienna?
In comparison, the United States recorded 18.5 maternal deaths per 100,000 or 0.018% in 2013 (source).
womens_death_childbirth = df_yearly.deaths.sum() / df_yearly.births.sum() * 100
print(f"Percentage of Women dying from child birth: {womens_death_childbirth:.3}")
Percentage of Women dying from child birth: 7.08
Challenge: Create a Matplotlib chart with twin y-axes. It should look something like this:
skyblue and crimson for the line coloursplt.figure(figsize=(14,8), dpi=200)
plt.title('Total Number of Monthly Births and Deaths', fontsize=18)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.grid(color='grey', linestyle='--')
ax1.plot(df_monthly.date,
df_monthly.births,
color='skyblue',
linewidth=3)
ax2.plot(df_monthly.date,
df_monthly.deaths,
color='red',
linewidth=2,
linestyle='--')
plt.show()
plt.figure(figsize=(14,8), dpi=200)
plt.title('Total Number of Monthly Births and Deaths', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.set_ylabel('Births', color='skyblue', fontsize=18)
ax2.set_ylabel('Deaths', color='crimson', fontsize=18)
# Use Locators
ax1.set_xlim([df_monthly.date.min(), df_monthly.date.max()])
ax1.xaxis.set_major_locator(years)
ax1.xaxis.set_major_formatter(years_fmt)
ax1.xaxis.set_minor_locator(months)
ax1.grid(color='grey', linestyle='--')
ax1.plot(df_monthly.date,
df_monthly.births,
color='skyblue',
linewidth=3)
ax2.plot(df_monthly.date,
df_monthly.deaths,
color='red',
linewidth=2,
linestyle='--')
plt.show()
Now let's look at the annual data instead.
Challenge: Use plotly to create line charts of the births and deaths of the two different clinics at the Vienna General Hospital.
line = px.line(df_yearly,
x='year',
y='births',
color='clinic',
title='Total Yearly Births by Clinic')
line.show()
line = px.line(df_yearly,
x='year',
y='deaths',
color='clinic',
title='Total Yearly Deaths by Clinic')
line.show()
Challenge: Calculate the proportion of maternal deaths per clinic. That way we can compare like with like.
df_yearly DataFrame by adding a column called "pct_deaths".df_yearly['pct_deaths'] = df_yearly.deaths / df_yearly.births
clinic_1 = df_yearly[df_yearly.clinic == 'clinic 1']
clinic_1.deaths.sum() / clinic_1.births.sum() * 100
9.924159265542361
clinic_2 = df_yearly[df_yearly.clinic == 'clinic 2']
clinic_2.deaths.sum() / clinic_2.births.sum() * 100
3.8839862852003826
line = px.line(df_yearly,
x='year',
y='pct_deaths',
color='clinic',
title='Proportion of Yearly Deaths by Clinic')
line.show()
The story continues...
At first, Dr Semmelweis thought that the position of the women giving birth was the issue. In clinic 2, the midwives' clinic, women gave birth on their sides. In the doctors' clinic, women gave birth on their backs. So, Dr. Semmelweis, had women in the doctors' clinic give birth on their sides. However, this had no effect on the death rate.
Next, Dr Semmelweis noticed that whenever someone on the ward died, a priest would walk through clinic 1, past the women's beds ringing a bell 🔔. Perhaps the priest and the bell ringing terrified the women so much after birth that they developed a fever, got sick and died. Dr Semmelweis had the priest change his route and stop ringing the bell 🔕. Again, this had no effect.
At this point, Dr Semmelweis was so frustrated he went on holiday to Venice. Perhaps a short break would clear his head. When Semmelweis returned from his vacation, he was told that one of his colleagues, a pathologist, had fallen ill and died. His friend had pricked his finger while doing an autopsy on a woman who had died from childbed fever and subsequently got very sick himself and died. 😮
Looking at the pathologist's symptoms, Semmelweis realised the pathologist died from the same thing as the women he had autopsied. This was his breakthrough: anyone could get sick from childbed fever, not just women giving birth!
This is what led to Semmelweis' new theory. Perhaps there were little pieces or particles of a corpse that the doctors and medical students were getting on their hands while dissecting the cadavers during an autopsy. And when the doctors delivered the babies in clinic 1, these particles would get inside the women giving birth who would then develop the disease and die.
Dr Semmelweis made handwashing obligatory in the summer of 1947. In fact, he ordered people to wash their hands with clorine (instead of water).
# Date when handwashing was made mandatory
handwashing_start = pd.to_datetime('1847-06-01')
Challenge:
df_monthly that has the percentage of deaths per birth for each row.df_monthly data: before and after Dr Semmelweis ordered washing hand.df_monthly['pct_deaths'] = df_monthly.deaths / df_monthly.births
df_monthly.head()
| date | births | deaths | pct_deaths | |
|---|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 | 0.15 |
| 1 | 1841-02-01 | 239 | 18 | 0.08 |
| 2 | 1841-03-01 | 277 | 12 | 0.04 |
| 3 | 1841-04-01 | 255 | 4 | 0.02 |
| 4 | 1841-05-01 | 255 | 2 | 0.01 |
before_washing = df_monthly[df_monthly.date < handwashing_start]
after_washing = df_monthly[df_monthly.date >= handwashing_start]
bw_rate = before_washing.deaths.sum() / before_washing.births.sum() * 100
aw_rate = after_washing.deaths.sum() / after_washing.births.sum() * 100
print(f'Average death rate before 1847 was {bw_rate:.4}%')
print(f'Average death rate AFTER 1847 was {aw_rate:.3}%')
Average death rate before 1847 was 10.53% Average death rate AFTER 1847 was 2.15%
Challenge: Create a DataFrame that has the 6 month rolling average death rate prior to mandatory handwashing.
Hint: You'll need to set the dates as the index in order to avoid the date column being dropped during the calculation.
roll_df = before_washing.set_index('date')
roll_df = roll_df.rolling(window=6).mean()
roll_df
| births | deaths | pct_deaths | |
|---|---|---|---|
| date | |||
| 1841-01-01 | NaN | NaN | NaN |
| 1841-02-01 | NaN | NaN | NaN |
| 1841-03-01 | NaN | NaN | NaN |
| 1841-04-01 | NaN | NaN | NaN |
| 1841-05-01 | NaN | NaN | NaN |
| ... | ... | ... | ... |
| 1847-01-01 | 264.47 | 34.97 | 0.14 |
| 1847-02-01 | 268.36 | 32.33 | 0.12 |
| 1847-03-01 | 274.31 | 29.17 | 0.11 |
| 1847-04-01 | 281.83 | 26.81 | 0.10 |
| 1847-05-01 | 289.50 | 24.81 | 0.09 |
76 rows × 3 columns
Challenge: Copy-paste and then modify the Matplotlib chart from before to plot the monthly death rates (instead of the total number of births and deaths). The chart should look something like this:
plt.figure(figsize=(14,8), dpi=200)
plt.title('Percentage of Monthly Deaths over Time', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(fontsize=14, rotation=45)
plt.ylabel('Percentage of Deaths', color='crimson', fontsize=18)
ax = plt.gca()
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_fmt)
ax.xaxis.set_minor_locator(months)
ax.set_xlim([df_monthly.date.min(), df_monthly.date.max()])
plt.grid(color='grey', linestyle='--')
ma_line, = plt.plot(roll_df.index,
roll_df.pct_deaths,
color='crimson',
linewidth=3,
linestyle='--',
label='6m Moving Average')
bw_line, = plt.plot(before_washing.date,
before_washing.pct_deaths,
color='black',
linewidth=1,
linestyle='--',
label='Before Handwashing')
aw_line, = plt.plot(after_washing.date,
after_washing.pct_deaths,
color='skyblue',
linewidth=3,
marker='o',
label='After Handwashing')
plt.legend(handles=[ma_line, bw_line, aw_line],
fontsize=18)
plt.show()
Challenge:
average_deaths_bhw = before_washing.pct_deaths.mean() * 100
average_deaths_ahw = after_washing.pct_deaths.mean() * 100
mean_monthly = average_deaths_bhw - average_deaths_ahw
times = average_deaths_bhw / average_deaths_ahw
print(f'Chance of death during childbirth before handwashing: {average_deaths_bhw:.3}%.')
print(f'Chance of death during childbirth AFTER handwashing: {average_deaths_ahw:.3}%.')
print(f'Handwashing reduced the monthly proportion of deaths by {mean_monthly:.3}%!')
print(f'This is a {times:.2}x improvement!')
Chance of death during childbirth before handwashing: 10.5%. Chance of death during childbirth AFTER handwashing: 2.11%. Handwashing reduced the monthly proportion of deaths by 8.4%! This is a 5.0x improvement!
Challenge:
.where() function to add a column to df_monthly that shows if a particular date was before or after the start of handwashing.df_monthly['washing_hands'] = np.where(df_monthly.date < handwashing_start, 'False', 'True')
df_monthly
| date | births | deaths | pct_deaths | washing_hands | |
|---|---|---|---|---|---|
| 0 | 1841-01-01 | 254 | 37 | 0.15 | False |
| 1 | 1841-02-01 | 239 | 18 | 0.08 | False |
| 2 | 1841-03-01 | 277 | 12 | 0.04 | False |
| 3 | 1841-04-01 | 255 | 4 | 0.02 | False |
| 4 | 1841-05-01 | 255 | 2 | 0.01 | False |
| ... | ... | ... | ... | ... | ... |
| 93 | 1848-11-01 | 310 | 9 | 0.03 | True |
| 94 | 1848-12-01 | 373 | 5 | 0.01 | True |
| 95 | 1849-01-01 | 403 | 9 | 0.02 | True |
| 96 | 1849-02-01 | 389 | 12 | 0.03 | True |
| 97 | 1849-03-01 | 406 | 20 | 0.05 | True |
98 rows × 5 columns
box = px.box(df_monthly,
x='washing_hands',
y='pct_deaths',
color='washing_hands',
title='How Have the Stats Changed with Handwashing?')
box.update_layout(xaxis_title='Washing Hands?',
yaxis_title='Percentage of Monthly Deaths',)
box.show()
Challenge: Create a plotly histogram to show the monthly percentage of deaths.
color parameter to display two overlapping histograms.histnorm to percent to make the time periods comparable.marginal parameter.hist = px.histogram(df_monthly,
x='pct_deaths',
color='washing_hands',
nbins=30,
opacity=0.6,
barmode='overlay',
histnorm='percent',
marginal='box')
hist.show()
hist.update_layout(xaxis_title='Proportion of Monthly Deaths',
yaxis_title='Count',)
hist.show()
Challenge: Use Seaborn's .kdeplot() to create two kernel density estimates of the pct_deaths, one for before handwashing and one for after.
shade/fill parameter to give your two distributions different colours.clip parameter to address the problem.plt.figure(dpi=200)
sns.kdeplot(before_washing.pct_deaths, fill=True, clip=(0,1))
sns.kdeplot(after_washing.pct_deaths, fill=True, clip=(0,1))
plt.title('Est. Distribution of Monthly Death Rate Before and After Handwashing')
plt.xlim(0, 0.40)
plt.show()
Challenge: Use a t-test to determine if the differences in the means are statistically significant or purely due to chance.
If the p-value is less than 1% then we can be 99% certain that handwashing has made a difference to the average monthly death rate.
stats from scipy.ttest_ind() function to calculate the t-statistic and the p-valueimport scipy.stats as stats
t_stat, p_value = stats.ttest_ind(a=before_washing.pct_deaths,
b=after_washing.pct_deaths)
print(f'p-palue is {p_value:.10f}')
print(f't-statstic is {t_stat:.4}')
p-palue is 0.0000002985 t-statstic is 5.512
What do you conclude from your analysis, Doctor? 😊